Prompt: 3

ChatGPT/AI disclosure statement: I used ChatGPT to fix some issues as mentioned below in the code.

1. Introduction

The last time Arsenal won the Premier League title was in the 2003-2004 season.

As one of the top clubs in the world, enhancing the team’s performance and increasing its influence while maintaining financial health has been a sustained goal for the club’s management.

Based on this, I collected data from every Arsenal match over the past ten seasons, along with the transfer and player signing data, in order to analyse the relationship between the club’s expenditure and performance each season, as well as their return on investment in player transfers.

2. Obtaining Arsenal’s Match Statistics

The primary data includes Arsenal’s match statistics over the past ten seasons, including goals scored, possession percentages, attendance figures, and so on. These metrics directly reflect the team’s performance and influence, forming the core foundation of this report’s analysis.

I collected this data from FBref, a publicly available football statistics website. On this website, each season’s matchlog is organised into a table. I went through all matchlog pages from last 10 season, and used rvest to automatically scrape these tables.

When knitting the Rmd, I was blocked by the website for an hour because of triggering its rate limit. After the block, I increased the delay to 3 seconds.

library(tidyverse)
library(rvest)

# Define a function that generalises to all seasons
scrape_and_save_table <- function(season, folder_path) {
  
 # General form of url for different seasons
  url <- paste0("https://fbref.com/en/squads/18bb7c10/", season, "/matchlogs/c9/schedule/Arsenal-Scores-and-Fixtures-Premier-League")
  
  page <- read_html(url)
    
    # Extract the table on the website that contains matchlog
    table_data <- page %>%
      html_node(xpath = '//*[@id="matchlogs_for"]') %>% 
      html_table(fill = TRUE) 
    
    # Save as CSV in the folder
    file_name <- paste0("Arsenal_", season, ".csv")
    file_path <- file.path(folder_path, file_name)
    write.csv(table_data, file_path, row.names = FALSE)
}

# Create the "primary data" folder path
folder_path <- "/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/primary data"

# Create the folder if it does not exist
if (!dir.exists(folder_path)) {
  dir.create(folder_path)
}

# Define the list of seasons
seasons <- paste0(2014:2023, "-", 2015:2024)

# Loop through each season, scrape and save the data
for (season in seasons) {
  scrape_and_save_table(season, folder_path)
  Sys.sleep(3)
}

3. Obtaining Arsenal’s Transfer Record

The secondary data includes the club’s transfer record, such as income, expenditure, and overall balance.

Moreover, I gathered Arsenal’s seasonal performance record as a supplementary dataset to the primary data on the team’s performance.

I also collected data on Arsenal’s new signings for each season to analyse the performance of the most expensive player signed in that season.

I obtained this data from the Transfermarkt website. During data collection process, I noticed the website had some restrictions on web scraping. I was unable to directly scrape data from the “transfers” page due to a persistent popup that could not be bypassed. To overcome this, I used RSelenium to simulate human actions, navigating from the website’s homepage to the “transfers” page for scraping.

# Since I have difficulty when trying to knit this Rmd, I use "eval=FALSE" in this part.
library(RSelenium)
library(rvest)
library(tidyverse)

url <- "https://www.transfermarkt.co.uk/"

# Start the Selenium server
rD <- rsDriver(browser = c("firefox"), verbose = FALSE, port = netstat::free_port(random = TRUE), chromever = NULL)
driver <- rD$client

# Navigate to the website
driver$navigate(url)
Sys.sleep(2)

# The cookie pop-up window is on frame 1
driver$switchToFrame(1)

# Click the Accept button
accept_button <- driver$findElement(using = "xpath", value = '/html/body/div/div[2]/div[3]/div[1]/div/button') 
accept_button$clickElement()
Sys.sleep(1) 

# Switch back to default frame
driver$switchToFrame(NA) 

# Search for Arsenal
search_field <- driver$findElement(using = "xpath", value = '/html/body/div[1]/div[5]/div[2]/form/input')
search_field$sendKeysToElement(list("Arsenal"))
Sys.sleep(1) 

# Click on the search button
search_button <- driver$findElement(using = "xpath", value = '/html/body/div[1]/div[5]/div[2]/form/button')
search_button$clickElement()
Sys.sleep(1) 

# Find Arsenal FC
arsenal_link <- driver$findElement(using = "xpath", value = '/html/body/div[1]/main/div[2]/div/div/div/div/table/tbody/tr[1]/td[2]/table/tbody/tr[1]/td')
arsenal_link$clickElement()
Sys.sleep(1) 

# Find all transfers
all_transfers_link <- driver$findElement(using = "xpath", value = '/html/body/div[1]/main/div[1]/div[2]/div[3]/a')
all_transfers_link$clickElement()
Sys.sleep(1) 
# Create the secondary data foler
if (!dir.exists("secondary data")) {
  dir.create("secondary data")
}

# Loop through the seasons from 14-15 to 23-24
for (i in 14:23) {
  season_folder <- paste0("secondary data/", i, "-", i + 1))
  
  # Create a folder for the current season
  if (!dir.exists(season_folder)) {
    dir.create(season_folder)
  }
 
  # Create the general xpath for each season (originally the general xpath should be paste0("//*[@id="selSM7_chzn_o_', 26 - i, '"]"). But the three numbers or letters after "sel" seem to be random and keep changing
  season_xpath <- paste0("//*[contains(@id, 'chzn_o_') and contains(@id, '_", 26 - i, "')]")

  # Filter by season
  filter_by_season <- driver$findElement(using = "xpath", value = '/html/body/div[1]/main/div[1]/div[1]/div[1]/div/form/div/div/table/tbody/tr[1]/td[2]/div/div/a')
  filter_by_season$clickElement()
  Sys.sleep(2)
  
  # Choose the season
  the_season <- driver$findElement(using = "xpath", value = season_xpath)
  the_season$clickElement()
  Sys.sleep(2)
  
  # Click on the Display button
  dispay_button <- driver$findElement(using = "xpath", value = '/html/body/div[1]/main/div[1]/div[1]/div[1]/div/form/div/div/table/tbody/tr[4]/td[3]/input')
  dispay_button$clickElement()
  Sys.sleep(2)
  
  # Extract the page source
  page_source <- driver$getPageSource()[[1]]
  
  # Parse the page source
  page_html <- read_html(page_source)
  
  # Extract the transfer record
  transfer_record <- page_html %>%
    html_element(xpath = "/html/body/div[1]/main/div[1]/div[2]/div[2]/table") %>%
    html_table(fill = TRUE)
  
  # Extract the season record
  season_record <- page_html %>%
    html_element(xpath = "/html/body/div[1]/main/div[1]/div[2]/div[3]/table") %>%
    html_table(fill = TRUE)
  
  # Extract the signing of the season
  signing <- page_html %>%
    html_element(xpath = "/html/body/div[1]/main/div[1]/div[1]/div[2]/div[2]/div/table") %>%
    html_table(fill = TRUE)
  
  # Save the three tables as CSV files
  write_csv(transfer_record, file = paste0(season_folder, "/transfer_record.csv"))
  write_csv(season_record, file = paste0(season_folder, "/season_record.csv"))
  write_csv(signing, file = paste0(season_folder, "/signing.csv"))
  
}
# Stop the Selenium driver
driver$close()
rD$server$stop()

4. Tabular Data and Transformations

4.1 Tabular data

First, I built tidy tabular data based on the data I scraped from the website.

(1)The primary data is tidy. Because each variable has its own column; each observation (each match) has its own row; each value has its own cell.

I printed the first few rows for the 2014-2015 season as an example:

library(tidyverse)
primary_14 <- read_csv("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/primary data/Arsenal_2014-2015.csv", show_col_types = FALSE)
head(primary_14)
## # A tibble: 6 × 17
##   Date       Time   Round       Day   Venue Result    GF    GA Opponent     Poss
##   <date>     <time> <chr>       <chr> <chr> <chr>  <dbl> <dbl> <chr>       <dbl>
## 1 2014-08-16 17:30  Matchweek 1 Sat   Home  W          2     1 Crystal Pa…    76
## 2 2014-08-23 17:30  Matchweek 2 Sat   Away  D          2     2 Everton        54
## 3 2014-08-31 16:00  Matchweek 3 Sun   Away  D          1     1 Leicester …    68
## 4 2014-09-13 12:45  Matchweek 4 Sat   Home  D          2     2 Manchester…    45
## 5 2014-09-20 15:00  Matchweek 5 Sat   Away  W          3     0 Aston Villa    71
## 6 2014-09-27 17:30  Matchweek 6 Sat   Home  D          1     1 Tottenham      69
## # ℹ 7 more variables: Attendance <dbl>, Captain <lgl>, Formation <chr>,
## #   `Opp Formation` <chr>, Referee <chr>, `Match Report` <chr>, Notes <lgl>

(2)As for secondary data, the season record and transfer record is tidy.

I printed the first few rows for the 2014-2015 season as an example:

season_record_14 <- read_csv("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/14-15/season_record.csv", show_col_types = FALSE)
head(season_record_14)
## # A tibble: 6 × 3
##   Competition...1 Competition...2       Achieve         
##   <lgl>           <chr>                 <chr>           
## 1 NA              League Cup            Third Round     
## 2 NA              UEFA Champions League Round of 16     
## 3 NA              Champions League Qu.  Qualifying Round
## 4 NA              FA Cup                <NA>            
## 5 NA              Premier League        3th             
## 6 NA              Community Shield      <NA>
transfer_record_14 <- read_csv("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/14-15/transfer_record.csv", show_col_types = FALSE)
head(transfer_record_14)
## # A tibble: 3 × 3
##   ...1            `Arrivals/Departures` Fee     
##   <chr>           <chr>                 <chr>   
## 1 Income          21                    €27.80m 
## 2 Expenditure     18                    €118.98m
## 3 Overall balance Overall balance       €-91.18m

The signing data is untidy. My goal is to obtain a tidy table containing the players’ names, position, and their transfer fee.

First I cleaned the signing data from the 2014-2015 season as an example:

library(tidyverse)
signing_14 <- read_csv("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/14-15/signing.csv")

cleaned_signing_14 <- signing_14 %>%
  # Remove the first three rows
  slice(-(1:3)) %>%
  # Remove the NA rows in Nat. column
  filter(!is.na(Nat.)) %>%
  # Remain the two columns I need
  select(Nat., Left, `...12`) %>%
  # Rename the column
  rename(
    Player = Nat.,
    Position = Left,
    Fee = `...12`
  )

print(cleaned_signing_14)
## # A tibble: 18 × 3
##    Player            Position           Fee                    
##    <chr>             <chr>              <chr>                  
##  1 Alexis Sánchez    Centre-Forward     €42.50m                
##  2 Calum Chambers    Centre-Back        €20.23m                
##  3 Danny Welbeck     Centre-Forward     €20.00m                
##  4 Mathieu Debuchy   Right-Back         €15.00m                
##  5 Gabriel Paulista  Centre-Back        €15.00m                
##  6 David Ospina      Goalkeeper         €4.00m                 
##  7 Krystian Bielik   Centre-Back        €2.25m                 
##  8 Héctor Bellerín   Right-Back         -                      
##  9 Joel Campbell     Right Winger       End of loanJun 30, 2014
## 10 Johan Djourou     Centre-Back        End of loanJun 30, 2014
## 11 Francis Coquelin  Defensive Midfield End of loanDec 12, 2014
## 12 Francis Coquelin  Defensive Midfield End of loanJun 30, 2014
## 13 Yaya Sanogo       Centre-Forward     End of loanMay 31, 2015
## 14 Wellington Silva  Right Winger       End of loanMay 31, 2015
## 15 Emiliano Martínez Goalkeeper         End of loanMay 4, 2015 
## 16 Wellington Silva  Right Winger       End of loanJun 30, 2014
## 17 Kristoffer Olsson Central Midfield   End of loanJan 3, 2015 
## 18 Semi Ajayi        Centre-Back        End of loanMay 31, 2015

Then I generalised to all signing data in the secondary_data folder, and saved the cleaned_signing files into the folder.

# Define the path to all folders
folders <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/", 14:23, "-", 15:24)

for (folder in folders) {
  
  file_path <- file.path(folder, "signing.csv")
  
  # Read and clean the data
  cleaned_signing_data <- read_csv(file_path) %>%
    # Remove the first three rows
    slice(-(1:3)) %>%
    # Remove the NA rows in Nat. column
    filter(!is.na(Nat.)) %>%
    # Remain the two columns I need
    select(Nat., Left, `...12`) %>%
    # Rename the column
    rename(
      Player = Nat.,
      Position = Left,
      Fee = `...12`
    )
  
  # Generate the path of the output data
  output_path <- file.path(folder, "cleaned_signing.csv")
  
  # Save the cleaned data
  write_csv(cleaned_signing_data, output_path)
}

4.2 Five Transformations

Next, I make five transformations of the data:

(1)Season performance

I created a table with two columns to evaluate Arsenal’s overall performance in Premier League each season. The two columns are: Season and Ranking.

# Define the path to all folders
folders <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/", 14:23, "-", 15:24)

# Generate seasons
seasons <- paste0(2014:2023, "-", 2015:2024)

# Initialise an empty data frame for results
season_performance <- data.frame(Season = character(), 
                          Ranking = numeric()
                         )

# Loop through each folder and process the data
for (i in 1:length(folders)) {
  
  folder <- folders[i]
  file_path <- file.path(folder, "season_record.csv")
  
  # Read the CSV file
  data <- read_csv(file_path, show_col_types = FALSE)
  
  # Extract the ranking
  # There are two special cases
  if (seasons[i] == "2014-2015") {
    ranking <- as.numeric(gsub("(st|nd|rd|th)", "", data$Achieve[5]))
  } else if (seasons[i] == "2021-2022") {
    ranking <- as.numeric(gsub("(st|nd|rd|th)", "", data$Achieve[3]))
  } else {
  ranking <- as.numeric(gsub("(st|nd|rd|th)", "", data$Achieve[4]))  
  } 
  # Add the results to the empty frame
  season_performance <- bind_rows(season_performance, 
                          tibble(Season = seasons[i], 
                                 Ranking = ranking))
}

print(season_performance)
##       Season Ranking
## 1  2014-2015       3
## 2  2015-2016       2
## 3  2016-2017       5
## 4  2017-2018       6
## 5  2018-2019       5
## 6  2019-2020       8
## 7  2020-2021       8
## 8  2021-2022       5
## 9  2022-2023       2
## 10 2023-2024       2
# Save the table to the derivative data folder
if (!dir.exists("derivative data")) {
  dir.create("derivative data")
}
write.csv(season_performance, "derivative data/season_performance.csv")

(2)Match performance

I calculated the win rate, total goals scored, total goals conceded, and average possession rate for each season and created a table with five columns to evaluate Arsenal’s match performance in each season. The five columns are: Season, Win_Rate, Total_Goals_Scored, Total_Goals_Conceded, and Average_Possession_Rate.

# Define the path to all folders
files <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/primary data/Arsenal_", 2014:2023, "-", 2015:2024, ".csv")

# Generate seasons
seasons <- paste0(2014:2023, "-", 2015:2024)

# Initialise an empty data frame for results
match_performance <- data.frame(Season = character(), 
                           Win_Rate = numeric(),
                           Total_Goals_Scored = numeric(),
                           Total_Goals_Conceded = numeric(),
                           Average_Possession_Rate = numeric()
                         )

# Loop through each folder and process the data
for (i in 1:length(files)) {
  
  file <- files[i]
  data <- read_csv(file)
  
  # Calculate the win rate, total goals scored, total goals conceded, and average possesion rate
  win_rate <- round(sum(data$Result == "W") / 38 * 100, 2)
  total_goals_scored <- sum(data$GF)
  total_goals_conceded <- sum(data$GA)
  average_possession_rate <- round(mean(data$Poss), 2)

  # Add the results to the empty frame
  match_performance <- bind_rows(match_performance, 
                          tibble(Season = seasons[i], 
                                 Win_Rate = win_rate,
                                 Total_Goals_Scored = total_goals_scored,
                                 Total_Goals_Conceded = total_goals_conceded,
                                 Average_Possession_Rate = average_possession_rate
                                ))
}

print(match_performance)
##       Season Win_Rate Total_Goals_Scored Total_Goals_Conceded
## 1  2014-2015    57.89                 71                   36
## 2  2015-2016    52.63                 65                   36
## 3  2016-2017    60.53                 77                   44
## 4  2017-2018    50.00                 74                   51
## 5  2018-2019    55.26                 73                   51
## 6  2019-2020    36.84                 56                   48
## 7  2020-2021    47.37                 55                   39
## 8  2021-2022    57.89                 61                   48
## 9  2022-2023    68.42                 88                   43
## 10 2023-2024    73.68                 91                   29
##    Average_Possession_Rate
## 1                    56.63
## 2                    58.34
## 3                    58.92
## 4                    61.45
## 5                    58.13
## 6                    53.76
## 7                    53.50
## 8                    52.76
## 9                    59.32
## 10                   58.18
# Save the table to the derivative data folder
write.csv(match_performance, "derivative data/match_performance.csv")

(3)Home influence

I calculated the average attendance at home for each season and created a table with two columns to evaluate Arsenal’s home influence. The two columns are: Season and Average_Attendance.

# Initialise an empty data frame for results
home_influence <- data.frame(Season = character(), 
                          Average_Attendance = numeric()
                         )

# Loop through each folder and process the data
for (i in 1:length(files)) {
  
  file <- files[i]
  data <- read_csv(file)
  
  # Calculate the average attendance at home
  average_attendance <- round(mean(data$Attendance[!is.na(data$Attendance) & data$Venue == "Home"]), 0)

  # Add the results to the empty frame
  home_influence <- bind_rows(home_influence, 
                          tibble(Season = seasons[i], 
                                Average_Attendance = average_attendance
                                ))
}

print(home_influence)
##       Season Average_Attendance
## 1  2014-2015              59992
## 2  2015-2016              59944
## 3  2016-2017              59957
## 4  2017-2018              59323
## 5  2018-2019              59899
## 6  2019-2020              60279
## 7  2020-2021               6000
## 8  2021-2022              59665
## 9  2022-2023              60191
## 10 2023-2024              60236
# Save the table to the derivative data folder
write.csv(home_influence, "derivative data/home_influence.csv")

Attendance during the 2020-2021 season was unusually low due to COVID-19, with most matches played without spectators and only a few allowing limited fans.

(4)Financial condition

I created a table with four columns to evaluate Arsenal’s financial condition. The four columns are: Season, Income, Expenditure, and Overall_Balance.

# Define the path to all folders
folders <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/", 14:23, "-", 15:24)

# Generate seasons
seasons <- paste0(2014:2023, "-", 2015:2024)

# Initialise an empty data frame for results
financial_condition <- data.frame(Season = character(), 
                         `Income/m€` = numeric(), 
                         `Expenditure/m€` = numeric(), 
                         `Overall_Balance/m€` = numeric()
                         )

# Loop through each folder and process the data
for (i in 1:length(folders)) {
  folder <- folders[i]
  file_path <- file.path(folder, "transfer_record.csv")
  
  # Read the CSV file
  data <- read_csv(file_path, show_col_types = FALSE)
  
  # Extract income, expenditure, and overall balance
  income <- as.numeric(gsub("[€m]", "", data$Fee[1]))  
  expenditure <- as.numeric(gsub("[€m]", "", data$Fee[2])) 
  overall_balance <- as.numeric(gsub("[€m]", "", data$Fee[3])) 

  # Add the results to the empty frame
  financial_condition <- bind_rows(financial_condition, 
                          tibble(Season = seasons[i], 
                                 `Income/m€` = income, 
                                 `Expenditure/m€` = expenditure, 
                                 `Overall_Balance/m€` = overall_balance)) %>%
                         select(Season, `Income/m€`, `Expenditure/m€`, `Overall_Balance/m€`)
}

print(financial_condition)
##       Season Income/m€ Expenditure/m€ Overall_Balance/m€
## 1  2014-2015     27.80         118.98             -91.18
## 2  2015-2016      2.50          26.50             -24.00
## 3  2016-2017     10.35         113.00            -102.65
## 4  2017-2018    162.40         152.85               9.55
## 5  2018-2019      9.10          80.15             -71.05
## 6  2019-2020     53.65         160.80            -107.15
## 7  2020-2021     19.15          86.00             -66.85
## 8  2021-2022     31.40         167.40            -136.00
## 9  2022-2023     23.80         186.40            -162.60
## 10 2023-2024     69.20         235.10            -165.90
# Save the table to the derivative data folder
write.csv(financial_condition, "derivative data/financial_condition.csv")

(5)Most expensive signings

To analyse the return on investment in player transfers, I decided to collecte data on each season’s most expensive signing and their performance.

First, I identified the most expensive player signed that season from the “cleaned_signing.csv” file. Then, I retrieved the corresponding player’s match performance statistics from FBref.

# Define the path to all folders
folders <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/", 14:23, "-", 15:24)

# Exclude the "15-16" folder, as the character of the player's name is special, different from the cleaned_signing.csv
folders <- folders[!grepl("15-16", folders)]

for (folder in folders) {

  csv_file <- file.path(folder, "cleaned_signing.csv")

  data <- read.csv(csv_file)
  
  # Extract the name of the most expensive player
  player_name <- data$Player[1]
  
  ### * the following line of code were generated by ChatGPT: 
  # Here I asked ChatGPT how to extract the year part from the folder path
  folder_basename <- basename(folder)  
  year_start <- as.numeric(str_extract(folder_basename, "^\\d{2}")) + 2000 
  year_end <- year_start + 1
  season <- paste0(year_start, "-", year_end)  
  
  # Create the general url
  url <- paste0("https://fbref.com/en/squads/18bb7c10/", season, "/roster/Arsenal-Roster-Details")
  
  page <- read_html(url)
  
  # Define the xpath of the player's statistics
  content_xpath <- paste0("//h2[a[text()='", player_name, "']]/following-sibling::div[2]//table")
  
  # Extract the player's statistics
  content_data <- page %>%
    html_nodes(xpath = content_xpath) %>%
    html_table(fill = TRUE)
  
    # Extract the table
    content_df <- content_data[[1]]
    
    # Set the first row as column names
    colnames(content_df) <- content_df[1, ]
    content_df <- content_df[-1, ]

    # Save the files to the folder
    output_file <- file.path(folder, "most_expensive_signing_data.csv")
    write.csv(content_df, output_file, row.names = FALSE)
    
    Sys.sleep(3)

}

Since the Xpath of the most expensive player from 2015-2016 is special, I extracted it separately.

page <- read_html("https://fbref.com/en/squads/18bb7c10/2015-2016/roster/Arsenal-Roster-Details")
  
# Define the special XPath
content_xpath <- paste0("//h2[a[text()='Petr Čech']]/following-sibling::div[2]//table")
  
# Extract Petr Čech's statistics
content_data <- page %>%
  html_nodes(xpath = content_xpath) %>%
  html_table(fill = TRUE)
  
# Extract the table
content_df <- content_data[[1]]

# Set the first row as column names
colnames(content_df) <- content_df[1, ]
content_df <- content_df[-1, ]

# Save the file to the folder
output_file <- file.path("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/15-16/most_expensive_signing_data.csv")
write.csv(content_df, output_file, row.names = FALSE)

Next, based on the extracted data, I calculated the player’s scoring efficiency (average minutes played per goal) and created a table with six columns. The six columns are: Season, Player, Position, Minutes_Played, Transfer_Fee, and Scoring_Efficiency.

# Define the path to all folders
folders <- paste0("/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/secondary data/", 14:23, "-", 15:24)

# Generate seasons
seasons <- paste0(2014:2023, "-", 2015:2024)

# Initialise an empty data frame for results
most_expensive_signing <- data.frame(Season = character(), 
                          Player = character(),
                          Position = character(),
                          Minutes_Played = numeric(),
                         `Transfer_Fee/m€` = numeric(),
                         `Scoring_Efficiency/(min/gls)` = numeric()
                         )

# Loop through each folder and process the data
for (i in 1:length(folders)) {
  
  folder <- folders[i]
  signing_file_path <- file.path(folder, "cleaned_signing.csv")
  player_file_path <- file.path(folder, "most_expensive_signing_data.csv")
  
  # Read the CSV file
  sigining_data <- read_csv(signing_file_path, show_col_types = FALSE)
  player_data <- read_csv(player_file_path, show_col_types = FALSE)
  
  # Extract the player's name, transfer fee
  player <- sigining_data$Player[1]  
  position <- sigining_data$Position[1]
  transfer_fee <- as.numeric(gsub("[€m]", "", sigining_data$Fee[1])) 
  
  # Check if Gls exists (One of the player is goalkeeper, and doesn't have goals)
  if (!("Gls...8" %in% colnames(player_data))) {
    scoring_efficiency <- NA
    player_data <- player_data %>%
     # Match the corresponding season
     filter(Season == seasons[i]) %>%  
     mutate(
       Min = as.numeric(gsub(",", "", Min...7))
     )
    minutes_played <- player_data$Min[1]
  } else {
   # Clean Min and Gls columns in player_data for calculations
   player_data <- player_data %>%
     # Match the corresponding season
     filter(Season == seasons[i]) %>%  
     mutate(
       # Here I checked the column names and found there are some special characters
       Min = as.numeric(gsub(",", "", Min...7)), 
       Gls = as.numeric(Gls...8)                
     )
   
  # Extract the player's minutes played
  minutes_played <- player_data$Min[1]
  
  # Calculate player's scoring efficiency
  # If Gls is NA, set scoring_efficiency to NA
  scoring_efficiency <- ifelse(
    is.na(player_data$Gls[1]) || player_data$Gls[1] == 0,
    NA, 
    round(player_data$Min[1] / player_data$Gls[1], 2)
  )
 } 
  
  # Add the results to the empty frame
  most_expensive_signing <- bind_rows(most_expensive_signing, 
                          tibble(Season = seasons[i], 
                                 Player = player,
                                 Position = position,
                                 Minutes_Played = minutes_played,
                                `Transfer_Fee/m€` = transfer_fee,
                                `Scoring_Efficiency/(min/gls)` = scoring_efficiency))

}

 # Remove the unexpected columns
  most_expensive_signing <- most_expensive_signing %>%
    select(
      Season,
      Player,
      Position,
      Minutes_Played,
     `Transfer_Fee/m€`,
     `Scoring_Efficiency/(min/gls)`
   )
  
print(most_expensive_signing)
##       Season                    Player           Position Minutes_Played
## 1  2014-2015            Alexis Sánchez     Centre-Forward           2942
## 2  2015-2016                 Petr Cech         Goalkeeper           3060
## 3  2016-2017              Granit Xhaka Defensive Midfield           2485
## 4  2017-2018 Pierre-Emerick Aubameyang     Centre-Forward           1416
## 5  2018-2019            Lucas Torreira Defensive Midfield           2382
## 6  2019-2020              Nicolas Pépé       Right Winger           2010
## 7  2020-2021             Thomas Partey Defensive Midfield            200
## 8  2021-2022                 Ben White         Right-Back           2880
## 9  2022-2023             Gabriel Jesus     Centre-Forward           2064
## 10 2023-2024               Declan Rice   Central Midfield           3225
##    Transfer_Fee/m€ Scoring_Efficiency/(min/gls)
## 1            42.50                       183.88
## 2            14.00                           NA
## 3            45.00                      1242.50
## 4            63.75                       108.92
## 5            28.65                      1191.00
## 6            80.00                       402.00
## 7            50.00                           NA
## 8            58.50                           NA
## 9            52.20                       187.64
## 10          116.60                       460.71
# Save the table to the derivative data folder
write.csv(most_expensive_signing, "derivative data/most_expensive_signing.csv")

5. Data Visualisation

5.1 Arsenal’s Team Performance And Expenditure Over the Past 10 Seasons

Firstly, I attempted to analyse the relationship between Arsenal’s financial expenditures and its overall performance.

I merged data from the tabular data above to create the chart below.

The bar chart represents the team’s expenditures per season, the blue line represents the win rate, and the red line represents Arsenal’s Premier League ranking.

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Combine data from financial_condition,season_performance and match performance
performance_financial_combined <- merge(
  merge(financial_condition, season_performance, by = "Season"),
  match_performance,
  by = "Season"
)

### * the following line of code were generated by ChatGPT: here I asked ChatGPT how to make the ranking line follow the secondary axis scale changes.
adjusted_ranking <- (10 - performance_financial_combined$Ranking) * (250 / 9)

# Create a dual-axis chart
ggplot(performance_financial_combined, aes(x = Season)) +
  # Bar chart for expenditure
  geom_bar(aes(y = `Expenditure/m€`, fill = "Expenditure"), stat = "identity", alpha = 0.7, width = 0.4) +
  # Line chart for win rate
  geom_line(aes(y = Win_Rate, group = 1, color = "Win Rate"), size = 1, linetype = "dashed") +
  geom_point(aes(y = Win_Rate, color = "Win Rate"), size = 2) +
  # Line chart for ranking
  geom_line(aes(y = adjusted_ranking, group = 1, color = "Ranking"), size = 1) +
  geom_point(aes(y = adjusted_ranking, color = "Ranking"), size = 2) +
  # Set legend and colors
  scale_fill_manual(name = "Expenditure", values = c("Expenditure" = "#f4a582")) +
  scale_color_manual(name = "Performance", values = c("Win Rate" = "#2166ac", "Ranking" = "#b2182b")) +
  # Add text to display the ranking values
  geom_text(aes(
    y = adjusted_ranking, 
    label = Ranking,
    color = "Ranking"
  ), vjust = -0.8, size = 3) +
  # Add a second axis
  scale_y_continuous(
    name = "Expenditure (m€) / Win Rate (%)",
    ### * the following line of code were generated by ChatGPT: here I asked ChatGPT how to set a secondary y-axis
    sec.axis = sec_axis(~ -9 / 250 * . + 10, name = "Premier League Ranking", breaks = seq(2, 10, by = 2)) ) +
  # Add labels and titles
  labs(
    title = "Arsenal's Team Performance And Expenditure Over the Past 10 Seasons",
    x = "Season",
    y = "Expenditure (m€) / Win Rate (%)"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "top",
    plot.title = element_text(hjust = 0.5)
  )

5.2 Most Expensive Players: Playing Time and Scoring Efficiency

Next, I aimed to analyse Arsenal’s return on investment in transfers.

I categorised these players into forward, midfield, back, and goalkeeper, with a focus primarily on forwards and midfielders due to their higher goal-scoring roles, while defender’s and goalkeeper’s performance cannot be evaluated by goals scored.

I created the following bar charts for forwards and midfielders, using different colors to indicate their scoring efficiency - darker colors indicating higher efficiency. Additionally, I designed interactivity so that hovering over each bar reveals additional information about the player.

library("plotly")
# Group most expensive players by Position
most_expensive_signing_position <- most_expensive_signing %>%
  mutate(Position = ifelse(Position %in% c("Centre-Forward", "Right Winger"), "Forward",
                   ifelse(Position %in% c("Defensive Midfield", "Central Midfield"), "Midfield",
                   ifelse(Position == "Right-Back", "Back",
                   ifelse(Position == "Goalkeeper", "Goalkeeper", "Other"))))) %>%
  arrange(Position) 

# Filter the data of forward and midfield
forward_data <- most_expensive_signing_position %>%
  filter(Position == "Forward")

midfield_data <- most_expensive_signing_position %>%
  filter(Position == "Midfield")

# Plot for Forward players
forward_plot <- ggplot(forward_data, aes(x = reorder(Player, `Scoring_Efficiency/(min/gls)`), 
                                         y = Minutes_Played, 
                                         fill = `Scoring_Efficiency/(min/gls)`)) +
  geom_bar(stat = "identity", width = 0.5, aes(text = paste(
    "Player: ", Player, "<br>",
    "Transfer Fee:", `Transfer_Fee/m€`, "m€<br>",
    "Minutes Played: ", Minutes_Played, "min<br>",
    "Scoring Efficiency: ", `Scoring_Efficiency/(min/gls)`, "min/gl"
  ))) +
  scale_fill_gradientn(colors = c("#54278f", "#756bb1", "#9e9ac8", "#bcbddc", "#dadaeb"), name = "Scoring Efficiency\n(min/gls)") +
  labs(
    title = "Forward Players: Playing Time and Scoring Efficiency",
    x = "Player",
    y = "Minutes Played"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle =15, hjust = 0.5, margin = margin(t = 5)), 
    legend.position = "right",
    legend.title = element_text(size = 10), 
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

# Plot for midfield players
midfield_plot <- ggplot(midfield_data, aes(x = reorder(Player, `Scoring_Efficiency/(min/gls)`), 
                                         y = Minutes_Played, 
                                         fill = `Scoring_Efficiency/(min/gls)`)) +
  geom_bar(stat = "identity", width = 0.5, aes(text = paste(
    "Player: ", Player, "<br>",
    "Transfer Fee:", `Transfer_Fee/m€`, "m€<br>",
    "Minutes Played: ", Minutes_Played, "min<br>",
    "Scoring Efficiency: ", `Scoring_Efficiency/(min/gls)`, "min/gl"
  ))) +
  scale_fill_gradientn(colors = c("#54278f", "#756bb1", "#9e9ac8", "#bcbddc", "#dadaeb"), name = "Scoring Efficiency\n(min/gls)") +
  labs(
    title = "Midfield Players: Playing Time and Scoring Efficiency",
    x = "Player",
    y = "Minutes Played"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle =15, hjust = 0.5, margin = margin(t = 5)), 
    legend.position = "right",
    legend.title = element_text(size = 10), 
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold")
  )

interactive_forward_plot <- ggplotly(forward_plot, tooltip = "text") 
interactive_midfield_plot <- ggplotly(midfield_plot, tooltip = "text") 

interactive_forward_plot
interactive_midfield_plot

6. Data and Output Storage

The raw data is stored separately in the “primary data” and “secondary data” folders. Some cleaned secondary data is also stored in the “secondary data” folder.

Tabular data obtained through five transformations is saved in the “derivative data” folder.

Two output charts are stored in the working directory.

library("htmlwidgets")
### * the following line of code were generated by ChatGPT: here I asked ChatGPT how to save the interactive charts
saveWidget(interactive_forward_plot, "forward_plot.html")
saveWidget(interactive_midfield_plot, "midfield_plot.html")
library(tidyverse)

rmd_file <- "/Users/zhangyuxin/Local/my472-at24-final-Yuxin-Zhang821/MY472-AT24-final-report.Rmd" # path to your Rmd file

read_file(rmd_file) %>% # read the file as a text file
  str_squish() %>% # remove all extra white space
  str_replace("^.+?output.+?[-]{3}", "") %>% # remove header
  str_replace_all("```[{].+?```", " ") %>% # remove code chunks
  str_replace_all("<![-].+?-->", " ") %>% # remove rmd comments
  str_replace_all("[!]?\\[.+\\][(].+[)]", " ") %>% # remove links
  str_replace_all("(^|\\s+)[^A-Za-z0-9]+", " ") %>% # remove symbols (1)
  str_replace_all("[^A-Za-z0-9]+($|\\s+)", " ") %>% # remove symbols (2)
  str_count("\\S+") %>% 
  paste("The document is", ., "words.") %>%
  print()